Exploring Red Wine Quality by Lennart Telwest

Introduction

About the dataset

This analysis will explore a dataset on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines.

The dataset contains red variants of the Portuguese “Vinho Verde” wine. Only physicochemical (inputs) and sensory (the output) variables are available.

Description of attributes:

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides: the amount of salt in the wine

  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. alcohol: the percent alcohol content of the wine

  12. quality (score between 0 and 10)

This desription and more background information can be found here.

This analysis is based on: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. The dataset is available here

Summary of the dataset

Let’s get a first glimpse on the available variables and their distribution by plotting the Five-number summary extended by the mean.

str(df)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(df)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Fixed Acidity

Defintion: most acids involved with wine or fixed or nonvolatile (do not evaporate readily). The fixed acidity is a right-skewed distribution, with a median of 7.9 g tartaric acid per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Volatile Acidity

Definition: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The volatile acidity is almost a symmetric distribution with a few positive outliers. The median and mean are ~5.2 g acetic acid per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Citric Acid

Definition: found in small quantities, citric acid can add ‘freshness’ and flavor to wines. The citric acidity is a linear decreasing distribution with median of ~0.26 g citric acid per liter. Most of the wines (>150) do not contain any citric acid. This distribution should be investigated for correlations later on!

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Residual Sugar

Definition: the amount of sugar remaining after fermentation stops, it’srare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. Most of the red wines in the dataset contain between 1.9 and 2.5 grams sugar per liter, while there is a long tail with wines that contain up to a maximum of 15 grams sugar per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Chlorides

Definition: the amount of salt in the wine. Most wines contain between 0.07 and 0.09 g of salt per liter. Again there is a long tail with wines containing up to 0.6 g of salt per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Free Sulfur Dioxide

Definition: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. In the dataset the free sulfur dioxid is found in a right-skewed distribution, with an average of 16g per liter and 50% of the wines containing 7-21g per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Total sulfur dioxide

Definition: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

The red wines contain at least 6 mg per liter and most wines (75%) contain no more than 62mg/l. The distribution is decreasing, with some very extreme outliers with more than 250mg/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Density

Definition: the density of water is close to that of water depending on the percent alcohol and sugar content.

The density is normally distributed, with an average and mean ~1g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

pH Value

Definition: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

The pH value is normally distributed, with a few positive and negative outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Sulphates

Definition: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

Sulphates is a right skewed distribution. Most wines contain between 0.3 and 0.7 g/l - with some outliers ranging up to 2g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Alcohol

Definition: the percent alcohol content of the wine

Most wines contain between ~10% of alcohol. The distribution is skewed with more wines containing more than 10% then lower percentage of alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Quality

Defintion: quality is the median of a score between 0 and 10 given by wine experts.

The experts moslty gave a rating of 5 or 6. Even though there are a more wines with a rating of 6 and higher than 5 and less the plot looks normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

As seen in the frequency diagramm above, the amount of wines that recieved a ranking <5 or >6 is negligible so I’ll bin them into the categories bad, OK, good and very good to get rid of the very few 3 & 8 ratings.

##       bad        OK      good very good 
##        63       681       638       217

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations of 13 numeric variables, with X being the ID it total there are 11 input variables and 1 output variable. The output variable quality is categorical, based on the median of at least 3 evaluations made by wine experts. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). As seen in the summary above, the output (median) quality of those ratings ranged only from 3 to 8, with a mean of 5.6 and median of 6. The main feature of interest is the quality which is what in the end matters when buying and drinking wine. The most interesting variables are those that are not normal distributed: Citric Acid, Total sulfur dioxide and Alcohol. One new variable has been introducded: the rating. As the distrubution of the quality is mostly containing 5’s and 6’s and scatter below and above, the rating acts as a summary metric. By matching the quality in the 4 rating categories bad, OK, good & very good it can be used in further plots to reduce scatter without loosing too much detail.

Bivariate Plots Section

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

The boxplots already hint towards some correlations:

Wines of higher quality have:

The other variables do not seem to have an impact on the quality of the wine.

Futher investigation of correlation of the variables against quality should be done using cor.test:

##    fixed.acidity volatile.acidity      citric.acid          density 
##       0.12405165      -0.39055778       0.22637251      -0.17491923 
##               pH  log10.sulphates          alcohol 
##      -0.05773139       0.30864193       0.47616632

The following variables have correlations to wine quality:

The pH-value does not correlate with wine quality.

Let’s see how these variables compare, plotted against each other and faceted by wine rating:

Most of the plots are nearly uniformly-distributed, but some observations could be made and will be explained in the analysis part.

Bivariate Analysis

The strongest correlation is between alcohol & quality which can be clearly seen in the boxplot with an ascending quality with increasing alcohol level. Also, following variables correlate with wine quality (in descending order):

There are two main observations when investigating the correlation between those variables:

  1. Alcohol and density correlate negatively: the higher the alcohol the lower the density. This can easily be explained with the lower density of alcohol compared to water and if there is relatively more alcohol contained the density is thereby lower.

  2. Fixed acidity and density correlate positively: the higher the fixed acidity the higher the density. This was suprising at first, as acidity does not correlate with the alcohol amount, which itself is correlating with the density. After doing a little research it seems the acidity has a chemical effect on the density, thus the correlation

Multivariate Plots Section

Multivariate Analysis

The scatterplots examine the 6 variables we had identified correlating with the quality of wines. To reduce the clutter, they are faceted by rating. The volatile acidity appears to be rather low for a wine of good quality,no matter the amount of alcohol contained. Also the sulphate level & alcohol plot show a picture of wines of good quality beeing concetrated in a smaller area around 11.8% alcohol and sulfate levels around 0.75. When looking at the citric acid there is a very interesting gap of good wines at 0.25. Some of the wines that contain less were given a good quality rating by the judges as well as mostly those above. All the wines that contain between 0.19 and 0.25 of citric acid are considered average or bad.


Final Plots and Summary

Plot 1: Effect of acids on wine quality

Effect of acids on wine quality

The boxplots show a very clear trend for citric acid as well as volatile acidity on the qualtiy of red wine. Lower volatile acidity and higher citric acidity lead to better wine quality. Those two forms of acid seem to cancel each other out, as they’re both influencing the fixed acidity and the pH value of the wine which does not show a clear trend which could be linked to wine quality.

Plot 2: Effect of Alcohol on Wine Quality

The effect of Alcohol on Wine Quality

These boxplots clearly show the effect of alcohol content on the quality of a wine. Even though there are outliers in the group of wines rated ‘OK’, in general a higher amount of alcohol is an indicator for a wine of good quality.

Plot 3: What makes good wines, good, and bad wines, bad?

Description Three

In this graph only the very good and bad wines had been considered after seeing the same trend not that clearly when considering wines with all types of rating. This summarizes the strongest findings that were made: For a wine to be of good quality, dependens on a low volatile acidity and a high amount of alcohol contained.


Reflection

The analysis of wine quality identified 6 different variables that correlate with red wine quality: alcohol, volatile acidity, sulphates, citric acid, density and fixed acidity. The alcohol contained as well as the volatile acidity are those variables that have the strongest impact on the red wine quality. Nonetheless it is imporant to keep in mind that this dataset contains only wines of a certain region and there could be regional differences that change the impact of e.g. volatile acidity on red wines from Portugal compared to Australia. As volatile acidity is hard to measure when buying a red wine, it might be worth going for the red wine with higher alcohol amount the next time when in doubt of which red wine to buy.